Method and apparatus for search engine World Wide Web crawling

ABSTRACT

A technique is provided for efficient search engine crawling. First, optimal crawling frequencies, as well as the theoretically optimal times to crawl each Web page, are determined. This is performed under an extremely general distribution model of Web page updates, one which includes both stochastic and generalized deterministic update patterns. Techniques from the theory of resource allocation problems which are extraordinarily computationally efficient, crucial for practicality because the size of the problem in the Web environment is immense. The second part employs these frequencies and ideal crawl times as input, creating an optimal achievable schedule for crawlers. The solution, based on network flow theory, is exact and highly efficient as well.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to “Method and Apparatus for WebCrawler Data Collection,” by Squillante et al., Attorney Docket No.YOR920030081US1, copending U.S. patent application Ser. No. 10/______,filed herewith, which is incorporated by reference herein in itsentirety.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates generally to information searching,and more particularly, to techniques for providing efficient searchengine crawling.

[0004] 2. Background of the Invention

[0005] Search engines play a pivotal role on the World Wide Web (“Web”).Every day, millions of people rely on search engines to quickly andaccurately retrieve relevant information. Without search engines,surfing the Web would be a nearly impossible task.

[0006] To facilitate searching, search engines often employ crawlers(also called “spiders” or “robots” (“bots”)). A crawler visits Web pageson various Web sites. Information read by a crawler is then used togenerate an index from the Web pages that have been read. The index isused by the search engine to return links to pages associated withsearch terms entered by users.

[0007] Web pages are frequently updated by their owners, sometimesmodestly and sometimes significantly. Studies have shown that 23 percentof Web pages change daily, while 40 percent of commercial Web pageschange daily. Some Web pages disappear completely, and a half-life of 10days for Web pages has been observed. Data gathered by a search engineduring its crawls can thus quickly become stale, or out of date. As aresult, crawlers must regularly revisit Web sites to maintain freshnessof the search engine's data.

[0008] Although search engines perform basic functions well, it is stillquite common for links to stale Web pages to be returned. For example,search engines frequently return links to Web pages that either nolonger exist or which have been changed. It can be very frustrating toclick on a link only to find that the result is incorrect, or worse thatthe page does not exist.

[0009] Given the importance of returning useful information, it woulddesirable and highly advantageous to provide techniques for moreefficient search engine crawling that overcome the deficiencies ofconventional approaches.

SUMMARY OF THE INVENTION

[0010] The present invention provides techniques for efficient searchengine crawling.

[0011] In various embodiments of the present invention, a scheme isprovided to determine the optimal crawling frequencies, as well as thetheoretically optimal times to crawl each Web page. It does so under anextremely general distribution model of Web page updates, one whichincludes both stochastic and generalized deterministic update patterns.It uses techniques from the theory of resource allocation problems whichare extraordinarily computationally efficient, crucial for practicalitybecause the size of the problem in the Web environment is immense. Thesecond part employs these frequencies and ideal crawl times as input,creating an optimal achievable schedule for crawlers. The solution,based on network flow theory, is exact and highly efficient as well.

[0012] These and other aspects, features and advantages of the presentinvention will become apparent from the following detailed descriptionof preferred embodiments, which is to be read in connection with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013]FIG. 1 is a block diagram illustrating exemplary components of thepresent invention;

[0014]FIG. 2 is a flow diagram outlining an exemplary technique forefficient search engine crawling;

[0015]FIG. 3 illustrates an exemplary embarassment-level decision tree,which indicates the way in which weights associated with each Web pagecan be computed;

[0016]FIG. 4 illustrates a possible graph of probability of clicking ona Web page as a function of its position and page in the search queryresults returned to a client;

[0017]FIG. 5 illustrates a possible freshness probability function forquasi-deterministic Web pages;

[0018]FIG. 6 is a flow diagram outlining steps involved in one of thekey calculations for quasi-deterministic Web pages;

[0019]FIG. 7 is a flow diagram outlining steps involved in solving theweb page allocation problem; and

[0020]FIG. 8 illustrates an exemplary transportation network to providea crawling schedule.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0021] According to various exemplary embodiments of the presentinvention, a scheme is provided to optimize the search engine crawlingprocess. One reasonable goal is the minimization of the average level ofstaleness over all Web pages. However, a slightly different metricprovides even greater utility. This involves an embarrassment metric,i.e., the frequency with which a client makes a search engine query,clicks on a link returned by the search engine, and then finds that theresulting page is inconsistent with respect to the query. In thiscontext, goodness corresponds to the search engine having a fresh copyof the web page. However, badness must be partitioned into lucky andunlucky categories: The search engine can be bad but lucky in a varietyof ways. In order of increasing luckiness, the possibilities are:

[0022] The Web page might be stale, but not returned to the client as aresult of the query;

[0023] The Web page might be stale, returned to the client as a resultof the query, but not clicked on by the client; and

[0024] The Web page might be stale, returned to the client as a resultof the query, clicked on by the client, but might be correct withrespect to the query anyway.

[0025] Thus, the metric under discussion only counts those queries onwhich the search engine is actually embarrassed. In this case, the Webpage is stale, returned to the client, who clicks on the link only tofind that the page is either inconsistent with respect to the originalquery, or (worse yet) has a broken link.

[0026] It is to be understood that the present invention may beimplemented in various forms of hardware, software, firmware, specialpurpose processors, or a combination thereof. Preferably, the presentinvention is implemented as a combination of hardware and software.Moreover, the software is preferably implemented as an applicationprogram tangibly embodied on a program storage device. The applicationprogram may be uploaded to, and executed by, a machine comprising anysuitable architecture. Preferably, the machine is implemented on acomputer platform having hardware such as one or more central processingunits (CPU), a random access memory (RAM), and input/output (I/O)interface(s). The computer platform also includes an operating systemand microinstruction code. The various processes and functions describedherein may either be part of the microinstruction code or part of theapplication program (or a combination thereof) that is executed via theoperating system. In addition, various other peripheral devices may beconnected to the computer platform such as an additional data storagedevice and a printing device.

[0027] It is to be further understood that, because some of theconstituent system components and method steps depicted in theaccompanying Figures are preferably implemented in software, the actualconnections between the system components (or the process steps) maydiffer depending upon the manner in which the present invention isprogrammed. Given the teachings herein, one of ordinary skill in therelated art will be able to contemplate these and similarimplementations or configurations of the present invention.

[0028] Referring to FIG. 1, a block diagram illustrating exemplarycomponents of the present invention is shown.

[0029] A crawler optimizer 101 determines an optimal number of crawlsfor each Web page over a fixed period of time called a schedulinginterval, as well as determining the theoretically optimal (ideal) crawltimes themselves. These two problems are highly interconnected. The samebasic scheme can be used to optimize either the staleness orembarrassment metric. The present invention supports models in which theupdates are fully stochastic. Another important model supported by thepresent invention is motivated by, for example, an information servicethat updates its Web pages at certain times of the day, if an update tothe page is necessary. This case, called quasi-deterministic, ischaracterized by Web pages whose updates might be characterized assomewhat more deterministic, in the sense that there are fixed potentialtimes at which updates might or might not occur.

[0030] Web pages with deterministic updates are a special case of thequasi-deterministic model. Furthermore, the crawling frequency problemcan be solved under additional constraints which make its solution morepractical in the real world. For example, one can impose minimum andmaximum bounds on the number of crawls for a given web page. The latterbound is important because crawling can actually cause performanceproblems for web sites.

[0031] The other component of the proposed invention, called a crawlerscheduler 102, employs as its input the output from the crawlerfrequency optimizer 101. (Again, this comprises the optimal numbers ofcrawls and the ideal crawl times). It then finds an optimal achievableschedule for the crawlers themselves. This part of the invention isbased on network flow theory, and can be posed specifically as atransportation problem. Moreover, one can impose additional real-worldconstraints, such as restricted crawling times for a given Web page.

[0032] 1. Invention Overview

[0033] Denote by N the total number of Web pages to be crawled, whichshall be indexed by i. Consider a scheduling interval of length T as abasic atomic unit of decision making. These scheduling intervals repeatevery T units of time, and the invention will make decisions about onescheduling interval using both new data and the results from theprevious scheduling interval. Let R denote the total number of crawlspossible in a single scheduling interval.

[0034] Assume that the time intervals between updates of page i followan arbitrary distribution function G_(i)() with mean λ_(i) ⁻¹>0.Suppose Web page i will be crawled a total of x_(i) times during thescheduling interval [0,T] (where x_(i) is a non-negative integer lessthan or equal to R), and suppose these crawls occur at times0≦t_(i,1)<t_(i,2)< . . . <t_(i,x) _(i) ≦T. The invention is based oncomputing a time-average staleness as: $\begin{matrix}{{a_{i}\left( {t_{i,1},\ldots \quad,t_{i,x_{i}}} \right)} = {\frac{1}{T}{\sum\limits_{j = 0}^{x_{i}}\quad {\int_{t_{i,j}}^{t_{i,{j + 1}}}{\left( {1 - {\lambda_{i}{\int_{0}^{\infty}{{{\overset{\_}{G}}_{i}\quad\left( {t - t_{i,j} + v} \right)}{v}}}}} \right)\quad {{t}.}}}}}} & (1)\end{matrix}$

[0035] where {overscore (G)}_(i)(t)≡1−G_(i)(t) is the tail distributionof interupdate times.

[0036] The times t_(i,1), . . . , t_(i,x) _(i) should be chosen so as tominimize the time-average staleness estimate a_(i)(t_(i,1), . . . ,t_(i,x) _(i) ), given that there are x_(i) crawls of page i. Deferringthe question of how to find the optimal values t_(i,1)*, . . . , t_(i,x)_(i) *, define the function A_(i) by setting

A _(i)(x _(i))=a _(i)(t _(i,1)*, . . . , t_(i,x) _(i) *).   (2)

[0037] Thus, the domain of this function A_(i) is the set {0, . . . ,R}.

[0038] While one would like to choose x_(i) as large as possible, thereis competition for crawls from other Web pages. Taking all web pagesinto account, one goal of the invention therefore is to minimize theobjective function $\begin{matrix}{\sum\limits_{i = 1}^{N}\quad {w_{i}{A_{i}\left( x_{i} \right)}}} & (3)\end{matrix}$

[0039] subject to the constraints $\begin{matrix}{{{\sum\limits_{i = 1}^{N}\quad x_{i}} = R},} & (4)\end{matrix}$

 x_(i) ε {m_(i), . . . , M_(i)}.   (5)

[0040] Here the weights w_(i) will determine the relative importance ofeach Web page i. The non-negative integers m_(i)≦M_(i) represent theminimum and maximum number of crawls possible for page i. They could be0 and R respectively, or any values in between. Practical considerationswill dictate these choices.

[0041] A complete description of the invention may include theadditional steps of:

[0042] Comparing the weights w_(i) for each Web page i.

[0043] Computing the functional forms a_(i) and A_(i) for each Web pagei.

[0044] Solving the resulting Web page crawler allocation problem in ahighly efficient manner.

[0045] Scheduling the crawls in the time interval T.

[0046] Referring to FIG. 2, a flow diagram outlining an exemplaryoverall technique for efficient search engine crawling is illustrated.

[0047] In step 201, i is initialized to 1. In step 202, the weight w_(i)for Web page i is computed. This step is refined in subsection 2. Instep 203, it is determined whether the Web page is fully stochastic(denoted FS) or quasi-deterministic (denoted QD). Then, in either step204 or step 205, the appropriate computation for A_(i) is accomplished.These steps differ depending on the type of Web page, and are furtherrefined in subsections 3 and 4, respectively. In step 206, i isincremented, and in step 207 i is tested agains N. If i≦N, controlreturns to step 202; otherwise, it proceeds to step 208, where the Webcrawl allocation problem is solved. This step is further refined insubsection 5. In step 209, the Web page crawler problem is solved. Thisstep is further refined in subsection 6.

[0048] 2. Computing Weights w_(i)

[0049]FIG. 3 illustrates a decision tree tracing the possible resultsfor a client making a search engine query. Fix a particular Web page iin mind, and follow the decision tree down from the root to the leaves.The invention chooses weights which will indicate the level ofembarrassment to the search engine.

[0050] The first possibility is for the page to be fresh. In this case,the Web page will not cause embarrassment. So, assume the page is stale.If the page is never returned by the search engine, there again can beno embarrassment. The search engine is lucky in this case. Next,consider what happens if the page is returned. A search engine willtypically organize its query responses into multiple result pages, andeach of these result pages will contain the URL's of several returnedWeb pages, in various positions on the page. Let P denote the number ofpositions on a returned page (which is typically on the order of 10).Note that the position of a returned Web page on a result page reflectsthe ordered estimate of the search engine for the web page matching whatthe user wants. Let b_(i,j,k) denote the probability that the searchengine will return page i in position j of query result page k. Thesearch engine can easily estimate these probabilities, either bymonitoring all query results or by sampling them for the client queries.

[0051] The search engine can still be lucky even if the Web page i isstale and returned. A client might not click on the page, and thus neverhave a chance to learn that the page was stale. Let C_(j,k) denote thefrequency that a client will click on a returned page in position j ofquery result page k. These frequencies also can be easily estimated,again either by monitoring or sampling.

[0052] This clicking probability function might look something like FIG.4. In any case the data can be collected by the search engine.

[0053] Even if the Web page is stale, returned by the search engine, andclicked on, the changes to the page might not cause the results of thequery to be wrong. Let d_(i) denote the probability that a query to astale version of page i yields an incorrect response. Once again, thisparameter can be easily estimated.

[0054] Then one can compute the total level of embarrassment caused tothe search engine by web page i as $\begin{matrix}{w_{i} = {d_{i}{\sum\limits_{j}^{\quad}\quad {\sum\limits_{k}^{\quad}\quad {c_{j,k}b_{i,j,k}}}}}} & (6)\end{matrix}$

[0055] 3. Computing the Functions A_(i)

[0056] For concreteness, this aspect of the invention will first bedescribed for G_(i)() as exponentially distributed. Those skilled inthe art will be able to understand the changes required to handle otherdistributions. Then the so-called quasi-deterministic case will bedescribed. This case is appropriate for Web pages i in which there are anumber of specific times u_(i,n) when the page is updated withprobability k_(i,n).

[0057] 3.1 Purely Stochastic Case

[0058] Here the invention computes $\begin{matrix}{{a_{i}\left( {t_{i,1},\ldots \quad,t_{i,x_{i}}} \right)} = {1 + {\frac{1}{\lambda_{i}T}{\sum\limits_{j = 0}^{x_{i}}\quad {\left( {^{- {\lambda_{i}{(t_{i,{j + 1 - t_{i,j}}})}}} - 1} \right).}}}}} & (7)\end{matrix}$

[0059] The optimum is known to occur at the value (T_(i,1)*, . . . ,T_(i,x) _(i) *) where the derivatives are equal. The summands are allidentical, and thus the optimal decision variables can be foundimmediately as T_(i,j)*=T/(x_(i)+1). Hence, the invention computes$\begin{matrix}{{A_{i}\left( x_{i} \right)} = {1 + {\frac{x_{i} + 1}{\lambda_{i}T}{\left( {^{{- \lambda_{i}}{T/{({x_{i} + 1})}}} - 1} \right).}}}} & (8)\end{matrix}$

[0060] Moreover, for any probability distribution, the optiminim isknown to occur at the value where the derivatives are equal and thesummands are identical.

[0061] 3.2 Quasi-Deterministic Case

[0062] In this case, there is deterministic sequence of times0≦u_(i,1)<u_(i,2)< . . . <u_(i), Q_(i)≦T defining possible updates forpage i, together with a sequence {k_(i,1), k_(i,2), . . . , k_(i, Qi)}defining the probabilities that the corresponding update actuallyoccurs. Define u_(i,0)≡0 and u_(i,Q) _(i) ≡T. Those skilled in the artwill appreciate that the update pattern is purely deterministic whenk_(i,j)=1 for all j ε {1, . . . , Q_(i)}.

[0063] A key observation of the present invention is that all crawlsshould be done at the potential update times, because there is no reasonto delay beyond when the update has occurred. This also implies thatx_(i)≦Q_(i)+1, as there is no reason to crawl more frequently. Hence,consider the binary decision variables $\begin{matrix}{y_{i,j} = \left\{ \begin{matrix}{1,} & {{{if}\quad a\quad {crawl}\quad {occurs}\quad {at}\quad {time}\quad u_{i,j}};} \\{0,} & {{otherwise}.}\end{matrix} \right.} & (9)\end{matrix}$

[0064] If there x_(i) crawls, then Σ_(j=0) ^(Q) ^(_(i)) y_(i,j)=x_(i).

[0065] Then, the stalesness probability function {overscore(p)}(y_(i,0), . . . , y_(i,Q) _(i) , t) at an arbitrary time t iscomputed by the following formula. $\begin{matrix}{{{\overset{\_}{p}\left( {y_{i,0},\ldots \quad,y_{i,Q_{i}},t} \right)} = {1 - {\prod\limits_{j = {{J_{i}{(t)}} + 1}}^{N_{i}^{u}{(t)}}\quad \left( {1 - k_{i,j}} \right)}}},} & (10)\end{matrix}$

[0066] where a product over the empty set, as per normal convention, isassumed to be 1.

[0067]FIG. 5 illustrates a typical staleness probability function{overscore (p)}. For visual clarity, the freshness function 1−{overscore(p)} is displayed rather than the staleness function). Here thepotential update times are noted by circles on the x-axis. Those whichare actually crawled are depicted as filled circles, while those thatare not crawled are left unfilled. The freshness function jumps to 1during each interval immediately to the right of a crawl time, and thendecreases, interval by interval, as more terms are multiplied into theproduct. The function is constant during each interval.

[0068] The invention then computes the corresponding time-averageprobability estimate as $\begin{matrix}{{\overset{\_}{a}\left( {y_{i,0},\ldots \quad,y_{i,Q_{i}}} \right)} = {\sum\limits_{j = 0}^{Q_{i}}\quad {{u_{i,j}\left\lbrack {1 - {\prod\limits_{k = {J_{i,j} + 1}}^{J}\quad \left( {1 - k_{i,j}} \right)}} \right\rbrack}.}}} & (11)\end{matrix}$

[0069] The present invention chooses the nearly optimal x_(i) crawltimes as shown in FIG. 6.

[0070] First, in step 601, k is initialized to 1. In step 602, j isinitialized to 0, and in step 603, y_(i,j) is initialized to 0. In step604, j is incremented, and in step 605, it is tested against Q_(i).

[0071] If j≦Q_(i), control returns back to step 603; otherwise, itproceeds to step 606, where m is initialized to 0. In step 607, thevalue o of the objective function is computed. In step 608, j isinitialized to 1, and in step 609 the value y_(i,j) is tested.

[0072] If the value y_(i,j) equals 0, control passes to step 614;otherwise, control continues to step 610. In step 610, the value O ofthe objective function is computed. In step 611, there is a test to seeif O−o>m. If it is, in step 612, m is set equal to O−o, and in step 613,J is set equal to j.

[0073] Next, in step 614, j is incremented. In step 615, j is testedagainst Q_(i). If j≦Q_(i), then control returns back to step 609;otherwise, it proceeds with step 616, which sets y_(i), J to 1. Then kis incremented in step 617, and tested against x_(i) in step 618. Ifk≦x_(i), control returns back to step 502. Otherwise, it halts with theproper values of y_(i,j) set to 1.

[0074] 4. Solving the Multiple Web Page Crawl Allocation Problem

[0075] As mentioned, the present invention finds the minimal values of$\sum\limits_{i = 1}^{N}\quad {w_{i}{A_{i}\left( x_{i} \right)}}$

[0076] subject to the constraints A_(i)(x_(i))=a(t_(i,1)*, . . . ,t_(i,x) _(i) *) and$\sum\limits_{i = 1}^{N}{w_{i}{{A_{i}\left( x_{i} \right)}.}}$

[0077] In various embodiments of the invention this can be accomplishedas shown in FIG. 7.

[0078] In step 701, the value of i is initialized to 1, and in step 702,the value of j is also initialized to 1. In step 703, the value ofD_(i,j) is defined to be the first difference:D_(i,j)=F_(i)(j+1)−F_(i)(j). In step 704, the value of j is incremented,and in step 705, the new value of j is tested.

[0079] If j≦R, control return back to step 703; otherwise, it proceedsto step 706, where i is incremented. In step 707, the new value of i istested. If i≦N, control returns back to step 702; otherwise, it proceedsto step 708, where r is initialized to 0. In step 709, I is initializedto 1. In step 710, x_(i) is initialized to m_(i), and in step 711, r isincremented by x_(i). In step 712, i is incremented and in step 713 thenew value of i is tested.

[0080] If i≦N, control returns back to step 710. Otherwise it proceedsto step 614 where v is initialized to ∞ (that is, set to a sufficientlylarge value). In step 715, i is initialized to 1. In step 716, x_(i) istested against M_(i). If x_(i)<M_(i), then the invention proceeds tostep 717, where D_(i)(x_(i)+1) is tested against v. If D_(i)(x_(i)+1)<v,then control proceeds to step 718, where v is set to D_(i)(x_(i)+1). Instep 719, I is set to i. In step 720, i is incremented. (This step canalso be reached from step 716 if x_(i)≧M_(i) and from step 717 ifD_(i)(x_(i)+1)≧v). In step 721, i is tested against N. If i≦N, controlreturns back to step 716; otherwise, it proceeds to step 722, wherex_(I) is incremented. In step 723, r is incremented and in step 724, itis tested against R. If r<R, control returns back to step 714.Otherwise, it halts with the desired solution.

[0081] 5. Solving the Crawler Scheduling Problem

[0082] Given that we know how many crawls should be made for each Webpage, the question now becomes how to best schedule the crawls over ascheduling interval of length T. (Again, we shall think in terms ofscheduling intervals of length T. We are trying to optimally schedulethe current scheduling interval using some information from the lastone). We shall assume that there are C possibly heterogeneous crawlers,and that each crawler k can handle S_(k) crawl tasks in time T. Thus wecan say that the total number of crawls in time T is R=Σ_(k=1)^(C)S_(k). We shall make one simplifying assumption that each crawl oncrawler k takes approximately the same amount of time. Thus, we candivide the time interval T into S_(k) equal size time slots, andestimate the start time of the lth slot on crawler k by T_(kl)=(l−1)/Tfor each 1≦l≦S_(k) and 1≦k≦C.

[0083] We know from the previous section the desired number of crawlsx_(i)* for each web page i. Since we have already computed the optimalschedule for the last scheduling interval, we further know the starttime t_(i,0) of the final crawl for web page i within the lastscheduling interval. Thus we can compute the optimal crawl timest_(i,1)*, . . . , t_(i,x) _(i) * for Web page i during the schedulinginterval. For the stochastic case, it is important for the scheduler toinitiate each of these crawl tasks at approximately the proper time, butbeing a bit early or a bit late should have no serious impact for mostof the update probability distribution functions we envision. Thus it isreasonable to assume a scheduler cost function for the jth crawl of pagei, whose update patters follow a stochastic process, that takesS(t)=|t−t_(i,j)*|. On the other hand, for a Web page i whose updatepatterns follow a quasi-deterministic process, being a bit late isacceptable, but being early is not useful. So an appropriate schedulercost function for the jth crawl of a quasi-deterministic page i mighthave the form $\begin{matrix}{{S(t)} = \left\{ \begin{matrix}{\infty,} & {{{if}\quad t} < t_{i,j}^{*}} \\{{t - t_{i,j}},} & {{otherwise}.}\end{matrix} \right.} & (12)\end{matrix}$

[0084] The problem can be posed and solved as a transportation problemin a manner described below.

[0085] Define a bipartite network with one directed arc from each supplynode to each demand node. The R supply nodes, indexed by j, correspondto the crawls to be scheduled. Each of these nodes has a supply of 1unit. There will be one demand node per time slot and crawler pair, eachof which has a demand of 1 unit. We index these by 1≦l≦S_(k) and 1≦k≦C.The cost of arc jkl emanating from a supply node j to a demand node klis S_(j)(T_(kl)). FIG. 8 shows the underlying network for an example ofthis particular transportation problem. Assume that each can crawl thesame number S=S_(k) of pages in the scheduling interval T. In thefigure, the number of crawls is R=4, which equals the number of crawlertime slots. The number of crawlers is C=2, and the number of crawls percrawler is S=2. Hence, R=CS.

[0086] The specific linear optimization problem solved by thetransportation problem can be formulated as follows. $\begin{matrix}{{Minimize}\quad {\sum\limits_{i = 1}^{M}\quad {\sum\limits_{j = 1}^{N}\quad {\sum\limits_{k = 1}^{M}\quad {{R_{i}\left( T_{j\quad k} \right)}f_{i\quad j\quad k}}}}}} & (13)\end{matrix}$

[0087] such that $\begin{matrix}{{{\sum\limits_{i = 1}^{M}f_{i\quad j\quad k}} = {1{\forall{1 \leq j \leq {N\quad {and}\quad 1} \leq k \leq M}}}},} & (14)\end{matrix}$

 f_(ijk)≧0∀1≦i,k≦M and 1≦j≦N.   (15)

[0088] Those skilled in the art will readily appreciate that thesolution of a transportation problem can generally be accomplishedefficiently. The nature of the transportation problem formulationensures that there exists an optimal solution with integral flows, andthe techniques in the literature find such a solution. This implies thateach f_(ijk) is binary. If f_(ijk)=1, then a crawl of web page i isassigned to the jth crawl of crawler k.

[0089] If it is required to fix or restrict certain crawl tasks fromcertain crawler slots, this an be easily done. One simply changes thecost of the restricted directed arcs to be infinite. (Fixing a crawltask to a subset of crawler slots is the same as restricting it from thecomplementary crawler slots).

[0090] Although illustrative embodiments of the present invention havebeen described herein with reference to the accompanying drawings, it isto be understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beaffected therein by one skilled in the art without departing from thescope or spirit of the invention.

What is claimed is:
 1. A method for determining search engineembarrassment, comprising: for each of a plurality of Web pages, (a)obtaining information regarding the probability that the Web page isstale and will be returned to and selected by a client, and (b)computing an embarrassment level using the obtained information.
 2. Themethod of claim 1, wherein computed embarrassment levels are used informulating a Web crawling schedule.
 3. A system for providing efficientsearch engine crawling, comprising: a crawler optimizer for determiningan optimal number of crawls and crawl times during a predetermined timeinterval for a predetermined number of Web pages; and a crawlerscheduler for determining an optimal achievable crawler schedule for apredetermined number of crawlers, using the determined number of crawlsand crawl times.
 4. The system of claim 3, wherein the crawler optimizerdetermines the optimal number of crawls and crawl times with respect tominimizing average level of embarrassment.
 5. The system of claim 3,wherein the crawler optimizer determines the optimal number of crawlsand crawl times using information as to whether Web pages are updated ina stochastic or quasi-deterministic manner.
 6. The system of claim 3,wherein the crawler optimizer is constrained by a minimum number ofcrawls of Web pages during the predetermined time interval.
 7. Thesystem of claim 3, wherein the crawler optimizer is constrained by amaximum number of crawls of Web pages during the predetermined timeinterval.
 8. The system of claim 3, wherein the crawler schedulerdetermines the optimal crawler schedule using a transportation networkmodel.
 9. The system of claim 3, wherein the crawler scheduler isconstrained by restricted crawling times for specified Web pages.
 10. Aprogram storage device readable by a machine, tangibly embodying aprogram of instructions executable on the machine to perform methodsteps for determining levels of embarrassment, the method stepscomprising: for each of a plurality of Web pages, (a) obtaininginformation regarding the probability that the Web page is stale andwill be returned to and selected by a client, and (b) computing anembarrassment level using the obtained information.
 11. The programstorage device of claim 10, wherein computed embarrassment levels areused in formulating a Web crawling schedule.
 12. A method fordetermining a level of embarrassment to a search engine, comprising:determining a level of embarrassment for each of a plurality of Webpages, the level of embarrassment for each of the plurality of Web pagesdetermined according to$w_{i} = {d_{i}{\sum\limits_{j}^{\quad}\quad {\sum\limits_{k}^{\quad}{c_{j,k}b_{i,j,k}}}}}$

where w_(i) is the level of embarrassment for Web page i, d_(i) is theprobability a query to a stale version of w_(i) yields an incorrectresponse, c_(j,k) is the frequency that a client will click on areturned page in a position j of a query result page k, and b_(i,j,k) isthe probability that the Web page i will be returned in the position jof the query result page k.